Exploring the PDO text of World Bank’s projects

TL;DR

WB
NLP
TextAnalytics
ML
DigitalHumanities
Rstats

The idea of analyzing language as data has always intrigued me. In this deep dive, I focus on ~4,000 World Bank Projects & Operations, zooming in on the short texts that describe the Project Development Objectives (PDOs)—an abstract of sorts for Bank’s operations.
This explorative analysis revealed fascinating—and surprising—insights, uncovering patterns in text but also ways to enhance the quality of projects’ data itself.

This is an ongoing project, so comments, questions, and suggestions are welcome. The R source code is open (still in progress and not fully polished).

Published

October 29, 2024

Motivation

I have always been fascinated by the idea of analyzing language as data and I finally found some time to study Natural Language Processing (NLP) and Text Analytics techniques.

In this learning project, I explore a dataset of World Bank Projects & Operations, with a focus on the text data contained in the Project Development Objective (PDO) section of World Bank’s projects (loans, grants, technical assistance). A PDO outlines, in synthetic form, the proposed objectives of operations, as defined in the early stages of the World Bank project cycle.

Normally, a few objectives are listed in paragraphs that are a couple sentences long. Table 1 shows two examples.

Table 1: Illustrative PDOs text in Projects’ documents
Project_ID Project_Name Project_Development_Objective
P127665 Second Economic Recovery Development Policy Loan This development policy loan supports the Government of Croatia's reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate.
P179010 Tunisia Emergency Food Security Response Project To (a) ensure, in the short-term, the supply of (i) agricultural inputs for farmers to secure the next cropping seasons and for continued dairy production, and (ii) wheat for uninterrupted access to bread and other grain products for poor and vulnerable households; and (b) strengthen Tunisia’s resilience to food crises by laying the ground for reforms of the grain value chain.

The dataset also includes some relevant metadata about the projects, including: country, fiscal year of approval, project status, main sector, main theme, environmental risk category, or lending instrument.s

I retrieved the data on this page WBG Projects. Such data is classified by the World Bank as “public” and accessible under a Creative Commons Attribution 4.0 International License.

(I made some attempts to make data ingestion automatic via API calls, but at the moment this is apparently beyond my web scraping ability 😬).

Data

The original dataset included 22,569 World Bank projects approved from fiscal year 1947 through 2025, as of August 31, 2024. Approximately half—11,322 projects—had a viable Project Development Objective (PDO) text (i.e., not blank or labeled as “TBD”, etc.), all approved after FY2001. From this group, some projects were excluded due to missing key variables.

This left 8,811 projects as usable observations for analysis.

Interestingly, within this refined subset, 2,235 projects share only 1,006 unique PDOs: recycled PDOs often appear in follow-up projects or components of a larger parent project.

Finally, from these 8,811 projects, a representative sample of 4,403 projects with PDOs was selected for further analysis.

First, it is important to notice that all 7,548 projects approved before FY2001 had no PDO text available.

The exploratory analysis of the 11,353 projects WITH PDO text revealed some interesting findings:

  1. PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
  2. PDO text missingness: besides 11,306 projects with missing PDOs, 31 projects had some invalid PDO values, namely:
    • 11 have PDO as one of: “.”,“-”,“NA”, “N/A”
    • 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
    • 9 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
    • 4 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “a”

Of the available 11,322 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:

  • 3 projects without “project status”
  • 2,176 projects without “board approval FY”
  • 332 projects approved in FY >= FY2024 (for incomplete approval stage)

Lastly (and this was quite surprising to me) the remaining, viable 8,811 unique projects, were matched by only 7,582 unique PDOs! In fact, 2,235 projects share 1,006 NON-UNIQUE PDO text in the clean dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”

In sum, the cleaning process yielded a usable set of 8,811 functional projects, which was split into a training subset (4,403) to explore and test models and a testing subset (4408), held out for post-prediction evaluation.

Preprocessing the PDO text data

Cleaning text data entails extra steps compared to numerical data. A key process is tokenization, which breaks text into smaller units like words, bigrams, n-grams, or sentences. After that, a common cleaning task is normalization, where text is standardized (e.g., converting to lowercase). Similarly, data reduction techniques like stemming and lemmatization simplify words to their root form (e.g., “running,” “ran,” and “runs” become “run”). This can help to reduce dimensionality, especially with very large datasets, when the word form is not relevant.

Upon tokenization, it is very common to remove irrelevant elements like punctuation or stop words (unimportant words like “the”, “ii)”, “at”, or repeated ones in context like “PDO”) which add noise to the data.

In contrast, data enhancement techniques like part-of-speech tagging add value by identifying grammatical components, allowing focus on meaningful elements like nouns, verbs, or adjectives.

In R, Part-of-Speech (POS) tagging can be done using the cleanNLP package, which provides a wrapper around the CoreNLP Java library. Executing these tasks is very computationally expensive. Based on random checks, the classification of POS tags in the PDO text data was not always accurate, but I considered it good enough for the purpose of this analysis.

Term Frequency

Figure 1 shows the most recurrent tokens and stems in the PDO text data.

Words and stems

Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800 (as they have been combined by root). Despite the pre-processing of PDOs’ text data, these aren’t particularly informative words.

Figure 1

Bigrams

Figure 2 shows the most frequent bigrams in the PDO text data. The top-ranking bigrams align with expectations, featuring phrases like “increase access”, “service delivery” ,“institutional capacity”, “poverty reduction” etc., at the top. Notably, while “health” appears in several bigrams (e.g., “health services”, “public health”, “health care”), “education” is absent from the top 25. Another intriguing observation is the frequent mention (over 100 instances) of “eligible crisis”, which seems somewhat unexpected.

Figure 2

Trigrams

Figure 3 shows the most frequent trigrams in the PDO text data. Here, the recurrence of phrases involving “health” is reiterated, along with a few phrases revolving around “environmental” goals, along with other terms that are expected to go together: like “water resource management”, “social safety net”, etc..

Figure 3

Sectors in the PDO text

To analyze a meaningful set of tokens, I examined the frequency of sector-related terms within the PDO text data. To capture the broader concept of “sector,” I created a comprehensive SECTOR variable that encompasses all relevant words within an expanded definition.

The “sector” term discussed here is not the sector variable available in the data, but it is an artificial construct reflecting the occurrence of terms referred to the same sector semantic field. Besides conceptual association, these definitions are rooted in the World Bank’s own classification of sector and sub-sector.

Below are the “broad SECTOR” definitions used in this analysis:

  • WAT_SAN = water|wastewater|sanitat|sewer|sewage|irrigat|drainag|river basin|groundwater
  • TRANSPORT = transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation|highway|transit|bridge|port
  • URBAN = urban|housing|inter-urban|peri-urban|waste manag|slum|city|megacity|intercity|inter-city|town
  • ENERGY = energ|electri|hydroele|hydropow|renewable|transmis|grid|transmission|electric power|geothermal|solar|wind|thermal|nuclear power|energy generation
  • HEALTH = health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin|immuniz|diseas|malaria|hiv|aids|tb|maternal|clinic|nutrition
  • EDUCATION = educat|school|vocat|teach|univers|student|literacy|training|curricul|pedagog
  • AGR_FOR_FISH = agricultural|agro|fish|forest|crop|livestock|fishery|land|soil
  • MINING_OIL_GAS = minin|oil|gas|mineral|quarry|extract|coal|natural gas|mine|petroleum|hydrocarbon
  • SOCIAL_PROT = social protec|social risk|social assistance|living standard|informality|insurance|social cohesion|gig economy|human capital|employment|unemploy|productivity|wage lev|intergeneration|lifelong learn|vulnerab|empowerment|sociobehav
  • FINANCIAL = bank|finan|investment|credit|microfinan|loan|financial stability|banking|financial intermed|fintech
  • ICT = information|communication|ict|internet|telecom|cyber|data|ai|artificial intelligence|blockchain|e-learn|e-commerce|platform|software|hardware|digital
  • IND_TRADE_SERV = industry|trade|service|manufactur|tourism|trade and services|market|export|import|supply chain|logistic|distribut|e-commerce|retail|wholesale|trade facilitation|trade policy|trade agreement|trade barrier|trade finance|trade promotion|trade integration|trade liberalization|trade balance|trade deficit|trade surplus|trade war|trade dispute|trade negotiation|trade cooperation|trade relation|trade partner|trade route|trade corridor
  • INSTIT_SUPP = government|public admin|institution|central agenc|sub-national gov|law|justice|governance|policy|regulation|public expenditure|public investment|public procurement
  • GENDER_EQUAL = gender|women|girl|woman|femal|gender equal|gender-base|gender inclus|gender mainstream|gender sensit|gender respons|gender gap|gender-based|gender-sensitive|gender-responsive|gender-transform|gender-equit|gender-balance
  • CLIMATE = climate chang|environment|sustain|resilience|adaptation|mitigation|green|eco|eco-|carbon|carbon cycle|carbon dioxide|climate change|ecosystem|emission|energy effic|greenhouse|greenhouse gas|temperature anomalies|zero net|green growth|low carbon|climate resilient|climate smart|climate tech|climate variab

The occurrence trends over time for key sector terms are shown in Figure 4.

Interestingly, all the broadly defined “sector term” in the PDO present one or more peaks at some point in time. For the (broadly defined) HEALTH sector, it is likely that Covid-19 triggered the peak in 2020. What about the other sectors? What could be the driving reason?

Figure 4

A possible explanation is that the PDOs may echo themes from the World Development Reports (WDR), the World Bank’s flagship annual publication that analyzes a key development issue each year. Far from being speculative research, each WDR is grounded in the Bank’s field-based insights and, in turn, it informs the Bank’s policy and operational priorities. This would suggest a likely alignment between WDR themes and project objectives in the PDOs.

To some extent, visual exploration (see examples below) seems to support this hypothesis: thematically relevant WDRs consistently appear in close proximity to peaks in sector-related term frequencies. However, further validation is necessary. Additionally, preparing each WDR typically takes 2-3 years, so a temporal alignment with project documents may include some lag.

Examples of sectors-term trend

Figure 5 shows a “combined sector” that is quite broadly defined (AGRICULTURE, FORESTRY, FISHING) with the highest peak in 2010, two years after the publication of the WDR on “Agriculture for Development”. Perhaps the “alignment” hypothesis is not very meaningful with such a broadly defined sector.

Figure 5

Figure 6, tracking frequency of CLIMATE-related terms, shows how the highest peak coincided with the publication of the WDR on “Development and Climate Change” in 2010.

Figure 6

Figure 7 reports two WDR publications relevant to EDUCATION, which seemingly preceded two peaks in the sector-related terms in the PDOs:

  • in 2007, on “Development and the Next Generation”
  • in 2018, on “Learning to Realize Education’s Promise”
Figure 7

Figure 8 shows that the highest frequency of terms related to GENDER EQUALITY was instead recorded a couple of years before the publication of the WDR on “Gender Equality and Development” in 2012.

Figure 8

Comparing PDO text against variable sector

The available data includes not only text but also relevant metadata, such as the sector1 variable, which captures the project’s primary sector. Do the terms in the PDO text align with this sector label? To examine this, I applied the two-sample Kolmogorov-Smirnov test to compare the distribution of sector-related terms in the PDO text with the distribution of sector1.

As shown in Table 2, the results indicate similar distributions across most sectors. This is promising, as it suggests that in cases where metadata is lacking, sector assignments can be reasonably inferred from the PDO text.

Table 2: Comparing the freqeuncy distributions of SECTOR in text and metadata
SECTORS KS statistic KS p-value Distributions
ENERGY 0.6522 0.0001 Dissimilar
HEALTH 0.3913 0.0487 Dissimilar
WAT_SAN 0.3913 0.0544 Similar
EDUCATION 0.3478 0.1002 Similar
ICT 0.2857 0.3399 Similar
MINING_OIL_GAS 0.3333 0.3442 Similar
TRANSPORT 0.2174 0.6410 Similar

The Kolmogorov-Smirnov Test (KS test) is a non-parametric test that compares the distribution of two samples. The null hypothesis is that the two samples are drawn from the same distribution. Notably, this test does not assume any particular underlying distribution. The test statistic is the maximum absolute difference between the two cumulative distribution functions. The p-value is the probability of observing a test statistic as extreme as the one computed, assuming that the null hypothesis is true.

Below is a graphical representation of two illustrative sectors, showing the most similar and the most dissimilar distributions of the SECTOR as deducted form text data, versus the proper metadata labeling.

Figure 9 shows the distributions of the TRANSPORT sector in the PDOs’ text and in the metadata. The two distributions are the most similar, as confirmed by the Kolmogorov-Smirnov test with a p-value of 0.641.

Figure 9

Figure 10 compares visually the distributions of the ENERGY sector in the PDOs’ text data and the metadata. The two distributions are the most dissimilar, as the Kolmogorov-Smirnov test confirms with a p-value of 0.0001.

Figure 10

Comparing PDO text against variable amount committed

Following up on the previous section, do trends observed in PDOs’ text also reflect the allocation of funds to specific sectors? I explored this question with the same approach as before, but this time I compared the distribution of sector-related terms in the PDOs’ text with the distribution of the sum of the amount committed in corresponding projects (filtered by sector1 category).

Given the very different ranges, I compared normalized values (using the Chi Square test) to evaluate whether two categorical distributions have rescaled the distributions using Min-Max Scaling both n and sum_commit to a [0, 1] range. This doesn’t assume normality and ensures both distributions are within the same bounds, though it doesn’t account for the shape of the distributions.

Let us pick a couple of examples of specific sectors to check visually.

ICT sector: words v. funding

The distributions in the ICT sector are THE least similar (K-S test p-value is = 0.4218).

Figure 11

WATER & SANITATION sector: words v. funding

The distributions in the “WATER & SANITATION” sector are THE most similar (K-S test p-value is = 0.0001).

Figure 12

The chi-square test, in this case, serves to evaluate the similarity of the distributions of sector-related terms in the PDOs’ text and the sum of the amount committed in corresponding projects. The results suggest that the distributions are similar across all sectors. However, the visual comparison of the ICT and MINING_OIL_GAS sectors shows that the distributions are not identical.

Concordances: a.k.a. keywords in context

Another useful analysis that can be done exploring text data refers to concordance, which enables a closer look at the context surrounding a word (or combination of words). This approach can help clarify the word’s specific meaning or reveal underlying patterns in the data.

The bigram “eligible crisis” in the PDOs

For example, looking at most recurrent bigrams (two-word combinations) in the PDO text, the phrase “eligible crisis” caught my attention. It appears in the PDOs of 112 projects, and in 32% of these cases, it is accompanied by either “respond promptly and effectively” or “immediate and effective response”. Table 3 shows a few examples of what seems to be a recurring standard phrasing.

Table 3: Context of the bigram “eligible crisis” in the PDOs
WB Project ID Excerpt of PDO Sentences with 'Eligible Crisis'
P179499 (...) and effective response in the case of an eligible crisis or emergency.
P176608 (...) promptly and effectively in the event of an eligible crisis or emergency.
P151442 (...) assistance programs and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P177329 (...) eligible crisis or emergency, respond promptly and effectively to it.
P127338 (...) capacity to respond promptly and effectively in an eligible crisis or emergency, asrequired.
P158504 (...) immediate and effective response in case of an eligible crisis or emergency.
P173368 (...) immediate and effective response in case of an eligible crisis or emergency in the kingdom of cambodia.
P178816 (...) the project regions and to respond to an eligible crisis
P160505 (...) theproject area, and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P149377 (...) mozambique to respond promptly and effectively to an eligible crisis or emergency.

The bigram “climate change” in the PDOs

Another frequently occurring bigram is “climate change”, found in 92 PDOs. Table 4 displays words that commonly appear near this bigram. Notably, the word “mitigation”—which I associate with a more aspirational, long-term response—appears more frequently than “adaptation”, which I view as a more practical, short-term response. However, the term “resilience” may convey a similar practical intent.

Table 4: Frequent words near “climate change”
Near 'climate change' Count Percentage
vulnerability 25 39.1%
mitigate 14 21.9%
resilience 14 21.9%
adapt 6 9.4%
hazard 5 7.8%

Table 5 shows examples, with highlighted words in the vicinity of the phrase of interest.

Table 5: Context of the bigram “climate change” in the PDOs
Near word (root) WB Project ID Closest Text
adapt P090731 (...) pilot adaptation measures addressing primarily, the impacts of climate change on their natural resource base, focused on biodiversity
adapt P120170 (...) a multi-sectoral dpl to enhance climate change adaptation capacity is anticipated in the cps.
adapt P129375 (...) objectives of the project are to: (i) integrate climate change adaptation and disaster risk reduction across the recipient’s
hazard P174191 (...) and health-related hazards, including the adverse effects of climate change and disease outbreaks.
hazard P123896 (...) agencies to financial protection from losses caused by climate change and geological hazards.
hazard P117871 (...) buildings and infrastructure due to natural hazards or climate change impacts; and (b) increased capacity of oecs governments
mitig P074619 (...) to help mitigate global climate change through carbon emission reductions (ers) of 138,000 tco2e
mitig P164588 (...) institutional capacity for sustainable agriculture, forest conservation and climate change mitigation.
mitig P094154 (...) removing carbon from the atmosphere and to mitigateclimate change in general.
resil P154784 (...) to increase agricultural productivity and build resilience to climate change risks in the targeted smallholder farming and pastoralcommunities
resil P112615 (...) the resilience of kiribati to the impacts of climate change on freshwater supply and coastal infrastructure.
resil P157054 (...) to improve durability and enhance resilience to climate change
vulnerab P149259 (...) to measurably reduce vulnerability to natural hazards and climate change impacts in grenada and in the eastern caribbean
vulnerab P146768 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.
vulnerab P117871 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.

🟡 Metadata quality enhancement with ML predictive models

The idea is to predict the missing tags (sector, environmental risk category, etc.) in the World Bank project documents, using the text of the Project Development Objective (PDO) section as input data.

🟡 Using ML models to predict a missing feature

Remember that text data is SPARSE!

To predict a missing feature (e.g., sector) based on available features from text data, several supervised machine learning algorithms can be applied. Given that you have a mixture of text and structured data, here are some suitable algorithms:

  • Logistic regression (LR) is a good starting point for binary classification tasks.
  • Random Forest (RF) is a robust algorithm that can handle a mix of data types.
  • Gradient Boosting Machine (GBM) is a powerful algorithm that can handle a mix of data types and is less prone to overfitting.
  • Deep Learning (DL) models, such as neural networks, can be used for more complex tasks, but they require more data and computational resources.
  • Naive Bayes (NB) is a simple algorithm that can be used for text data, but it assumes that the features are independent, which is not always the case with text data.
  • Support Vector Machine (SVM) is a powerful algorithm that can be used for text data, but it requires more data and computational resources.

Steps of prediction

  1. label engineering Define what we want to predict (outcome variable, \(y\)), and its functional form (binary or multiclass, log form or not if numeric)
    • Deal with missing value in \(y\) (understand if there are systematic reasons for missingness, and if so, how to address them) + Deal with extreme values of \(y\) (conservatively is best)
  2. sample design Select the observations to use .
    • For high external validity it will have to be as close as possible to the population of interest (patterns of variables’ distribution etc.)
  3. feature engineering Define the input data (predictors, \(X\)) and their format (text, numeric, categorical)
    • Deal with missing values in \(X\) (understand, variable by variable, the reasons for missingness, and decide what to do: keep, impute value if numeric, drop the predictor?)
    • Select the most relevant predictors (which \(X\) to have and in which form). For text predictor data, there are specific NLP transformations that can be applied (e.g. tokenization, lemmatization, etc.)
    • In some cases interaction between predictor variables makes sense.
    • Alternative models can be build with less predictors in simpler form to compare with others with more predictors in more complex form… Here domain knowledge + EDA are key to decide what to include and what to exclude.
  4. model selection It’s impossible to try all possible models (i.e. all possible choices of \(X\) variables to include, their possible functional form, and their possible interactions give too many combinations).
    • cross-validation is similar to training-test method, which basically splits the data into training and test sets, but it does this multiple times (e.g. k-fold cross-validation means \(k\) = 10 test sets) and it helps selecting the best model without overfitting.
    • Here we do all the work said above (model building and best model selection) in the work set. This will be further divided \(k-times\) into \(k\) train-test splits, then we use the holdout set to evaluate the prediction itself.
  5. last_fit means that, once the best model(s) is/are selected, they are re-run on all of the work set (training data) to evaluate the performance to obtain the final model.
  6. post-prediction diagnostic, lastly, serves to evaluate the model’s performance on the hold-out sample instead. Here we can
    • evaluate the fit of the prediction (using, MSE, RMSE, accuracy, ROC etc. to summarize goodness of fit
    • (for continuous \(y\)) we can visualize the prediction interval around the prediction, for discrete \(y\) the confusion matrix.
    • we can zoom in on the kinds of observations we care about the most or look at the fit in certain sub-samples of the data (e.g. by sector, by year, etc.)
    • finally we should assess the external validity (hold out set helps but is not representative of all the “live data”)

LASSO (Least Absolute Shrinkage and Selection Operator) is a sort of “add-on” to linear regression models which, by adding a penalty system, finds a way to get better predictions from regressions with many predictors, by selecting a subset of the predicting variables that helps to avoid overfitting. The output of the LASSO algorithm is the values of the coefficients of the predictors that are kept in the model. In the formula \(λ\) is the tuning parameter term, which is a parameter that can be tuned to get the best model.

🟡 Evaluating the performance of the preferred ML model to predict a missing feature

Conclusions and next steps

  • Evidently, this project was primarily a learning / proof-of-concept exercise, so I wasn’t concerned with in-depth analysis of the data, nor with maximizing ML models’ predictive performance. Nevertheless, this initial exploration demonstrated the potential of applying NLP techniques to unstructured text data to uncover valuable insights, such as:

    • detecting frequency trends of sector-specific language, and topics over time,
    • improving documents classification and metadata tagging, via ML predictive models,
    • uncovering surprising patterns and relationships in the data, e.g. recurring phrases or topics,
    • triggering additional text-related questions that could lead to further research.
  • Next steps could include:

    • delving deeper into hypothetical explanations for the patterns observed, e.g. by combining NLP on this document corpus with other data sources (e.g. information on other WB official documents and policy statements);
    • exploring more advanced NLP techniques, such as Named Entity Recognition (NER), Structural Topic Modeling (STM), or BERTopic, to enhance the analysis and insights drawn from World Bank project documents.
  • A pain point in this type of work is efficiently retrieving input data from document corpora. Despite the World Bank’s generous “Access to Information” policy, programmatic access to its extensive text data resources is still quite hard (no dedicated API, various stale pages and broken links). This should be addressed, perhaps following the model of the World Development Indicators (WDI) data, much more accessible and well-curated.

  • Amid the ongoing hype around AI and Large Language Models (LLMs), this kind of analysis seems like yesterday’s news. However, I believe there is still a huge untapped potential for meaningful applications of NLP and text analytics in development studies, policy analysis, and other areas—which will be even more impactful if informed by domain knowledge.

TOOL

XXX

xxx

xxx

xxx

xxx

xxx

Acknowledgements

Below are some valuable resources to learn and implement NLP techniques–geared toward R programmers.

References

Engel, Claudia, and Scott Bailey. 2022. Text Analysis with R. https://cengel.github.io/R-text-analysis/.
Francom, Jerid. 2024. An Introduction to Quantitative Text Analysis for Linguistics: Reproducible Research Using R. 1st ed. London: Routledge. https://doi.org/10.4324/9781003393764.
Future Mojo, dir. 2022. Natural Language Processing Demystified - YouTube. https://www.youtube.com/playlist?list=PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.
Heiss, Andrew. 2022. “Text.” Data Visualization Course. 2022. https://datavizs22.classes.andrewheiss.com/example/13-example/#sentiment-analysis.
Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. First edition. Data Science Series. Boca Raton London New York: CRC Press. https://smltar.com/.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly. https://www.tidytextmining.com/.